Home Credit Default Risk

Group 23

TEAM AND PROJECT META INFORMATION

Email IDs:

Raj Chavan : rchavan@iu.edu

Sanket Bailmare: sbailmar@iu.edu

Shefali Luley: sluley@iu.edu

Tanay Kulkarni : tankulk@iu.edu

Group : 23

Members: Raj Chavan Sanket Bailmare Shefali Luley Tanay Kulkarni

Raj.pngSanket.pngShefali.pngTanay.png

PROJECT ABSTRACT

In today’s world, many people struggle to get loans due to insufficient credit histories or even non-existing credit records, which often tend to untrustworthy lenders who exploit them.Home Credit acts towards expanding the financial inclusion for the unbanked by providing a secure borrowing experience. Home credit utilizes several alternative data and methods to predict their repayment abilities. To ensure that this underserved demographic has a favorable loan experience, we will be using machine learning and statistical methods to determine these predictions. This ensures that the clients who are capable of repayment will be granted a loan and are not rejected by any means. Also, they will be given a loan maturity plan and a repayment calendar that will accredit our clients to be more successful. Our goal in this phase is to work on the Home Credit Default Risk (HCDR) data and perform some cleaning and preprocessing techniques and modeling techniques with pipelining to predict whether a person receives a loan or not. After merging all subordinate files, cleaning the data, performing some exploratory data analysis, we apply some preprocessing steps which also include dividing the data into numerical and categorical values. The next step involved creating new features from the existing features that could add on as a good predictor in the existing dataset. The project included 3 phases where, the first phase included training the model without feature engineering and hyperparameter tuning, the second phase included creation of 11 new features from the existing important features and creating a modeling pipeline where we determined the best hyperparameters (hyperparameter tuning) for the models to select the best model. The testing data was then parsed through the best model to get the test results. The best model for the second phase was of the Random Forest Classifier with a training accuracy of 92.4% and a test roc score of 0.72814. The third and the final phase included creation of a Multi Layer Perceptron (MLP) neural network using Pytorch.The Artificial Neural Network model yielded a training score of 91.95% and a test roc score of 0.71225.

PROJECT DESCRIPTION

DATA DESCRIPTION

DataSet Link: https://www.kaggle.com/c/home-credit-default-risk/data

Tasks to be tackled :

Phase 3

Approach

Here are some screenshots from Tensorboard that we got for the model

Screen%20Shot%202021-12-13%20at%203.04.20%20PM.png

Screen%20Shot%202021-12-13%20at%203.04.44%20PM.png

Screen%20Shot%202021-12-13%20at%203.05.04%20PM.png

Screen%20Shot%202021-12-13%20at%203.05.21%20PM.png

Rough-5.jpg

Phase 2

Approach

Diagram : This is a block diagram to understand the workflow of the data. Screenshot 2021-11-16 at 10.23.03 PM.png

image.png

image.png

EXPLORATORY DATA ANALYSIS + FEATURE ENGINEERING AND TRANSFORMERS

Feature Engineering:
Phase1

image.png

image.png

image.png

Declaring some functions

Columns with more than 90% zero values in them

Dropping the above columns from the dataset

Columns in training set having more than 30% of missing data, along with their median/mode and unique values in the column

Segregating the Dataset in numerical and categorical dataframes

Numerical Columns and their correlation with the TARGET column in descending order

The Columns named "NAME_FAMILY_STATUS, CODE_GENDER, NAME_INCOME_TYPE"does not have values 'Unknown','XNA' and 'Maternity Leave' in the test dataset thus these rows are removed from the training dataset and there are a total of 11 rows that are removed.

Considering Columns that have more than 2% correlation with the TARGET variable

Out of these columns we first need to deal with missing values of the columns. Thus we get all columns which have missing values

Here in cols_with_no_missing we keep columns not having missing values

We set a threshold of 10 for unique values in a column, if it is 10 or less we do not consider them as numerical but discrete and replace the missing values with the mode of the column

Finally, we get all the column names for the numerical columns that we will consider in our modeling and also imputation

This gives us the final numerical dataframe

Here, we get the missing value of columns for the categorical columns, we see that out of 16 columns we get 6 columns that have missing values

Getting the counts of each category in these columns

Here we observe that for columns :

1)FONDKAPREMONT_MODE there is 68% missing data which makes it not ideal to impute missing values with the mode

2)WALLSMATERIAL_MODE, here the difference between the first and the second most occuring values is not much thus imputation with Mode is ambiguous

3)OCCUPATION_TYPE, this is a column that has 31% missing value and again there can not be one specific value that can be decided to impute the missing data with

Thus we remove these columns from the categorical part of the dataframe

VISUAL EXPLORATORY DATA ANALYSIS

Visualizing the categorical columns to understand the data more efficiently

How is the distribution of of loan according to gender?

Inference: The number of female borrowing the loan and who haven't paid is comparatively higher than men.

What is the marital status of client?

Inference: The majority of client who are Married have paid the least loan amount while the status of unknown is negligible.

How many percent of client own a car?

Inference: About 50 % of people own's a car, but there's majority of client (more than 50%) who doesn't possess a car and most of them are likely who haven't paid the loan.

What type of educational background does the clients have?

Inference: Clients with Academic Degree are more likely to repay the loan compared to others.

What are the types of housing does the clients stay in?

Inference: From the above graphical presentation, we can see that majority of the clients stay in apartment/House haven't paid the loan amount, while the least amount of them stay in office apartment and co-op apartment are negligible.

What are income type of applicant in terms of loan does the clients have?

Inference: All the Students and Businessman are neglibile, here we can see that majority of working people are hardly paying the loan.

Inference: The loan approval process has the highest count starting tuesday ,while the lowest count can be clearly seen on the weekends.

What type of loan are available?

Inference: Many people are willing to take cash loan than revolving loan

Here, we plot some graphs for the columns with highest correlations with the Target variable and observe the trend with respect to the target variable.

Inference:We see these for the columns EXT_SOURCE_3, EXT_SOURCE_2 and EXT_SOURCE_1 and can observe a clear strong negative correlation

What type of correlation does the columns DAYS_BIRTH and DAY_LAST_PHONE_CHANGE have with respect to target?

Inference: From the above plots,we can see that for columns of DAYS_BIRTH and DAYS_LAST_PHONE_CHANGE we see a strong positive correlation with respect to the target.

MODELING PIPELINES

IMG_6A43389925BA-1.jpeg

In this project, we are creating three pipelines, one for numerical data, one for categorical data and finally a pipeline to combine the data.

(i).Numerical data pipeline: For the pipeline with numerical data also called ‘num_pipeline’, we impute the missing values by the mean of the columns.

(ii). Categorical data pipeline: For the pipeline with categorical data also called ‘cat_pipeline’, we impute the missing values by the mode or the most frequent data.

(iii)Final pipeline: We create a pipeline to merge the numerical and categorical columns that have no missing values. The categorical columns are also one hot encoded.

Importing necessary packages

Finally selecting only the columns that we have finally decided for the numerical and the categorical part

Making two pipelines one for the numerical data where we impute the missing values by the mean of the columns, and the other for categorical data where we impute the missing data in the categorical columns using the Mode or the most frequent data.

Here we create a pipeline to merge the numerical and categorical columns that have no missing values and the categorical columns are one hot encoded.

This is the final dataset that we get for training our model

1)Amt credit/Amt annuity, 2)Amt installment /Amt credit, 3)Day_first_due - Day_last_due, 4)Amt application/Amt payment 5)Amt credit sum/amt credit 6)Amt installment /Days installment 7))Amt credit sum/days credit

Including features only with 5% and above correlation

Leakage

Data leakage in Machine Learning modeling pipelines usually occurs when the data that is available during training or feature engineering is unavailable on the ime of inference or testing the accuracy of model for untested/unseen data. For the pipeline that we have used, we see that there is no data leakage as we have dealt with all NaN values appropriately, and the data type has been set uniform across columns. We have also ensured that the new features that we have generated during feature engineering have been used appropriately during training and is available at the time of inference.

The most common cardinal sins that could map to this project that we understood were:

Thus we believe that we have not committed any cardinal sins of Machine Learning in this project.

RESULTS AND DISCUSSION OF RESULTS

Here we will be using the lbfgs solver which is a Limited-memory BFGS (L-BFGS or LM-BFGS) optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory

Pipeline+GridSearchCV

Logistic Regression

Hyperparameter tuning

RandomForestClassifer

Pipeline

Here we see that Random Forest Classifier has the best training accuracy of around 99% but an accuracy like this runs a risk of overfitting. The logistic Regression Model also gives an accuracy of 91% which is descent enough and seems to be a consdierable model. The ROC Area Under Curve value for Random Forest Classifier and the Logistic regression models are 0.704 and 0.735 respectively and both of them show a significant amount of True Positive values which shows that this is a good model fit. We cannot consider Naive Bayes as our model as it is evidently underfitting the data. We need to check how Random Forest classifier and the Logistic Regression model works on the test data to confirm if Random forest is really overfitting.

For Kaggle submission

In the following pipeline we use standard scaler to normalize the data with mean being zero and the standard deviation being 1, and use Logistic Regression as our modeling algorithm with solver being lbfgs and interations for convergence being 1000

Test Dataset

Here we get the same final columns as for the training dataset in the test dataset

Neural Network

RESULTS AND DICUSSION: NEURAL NETWORK

image.png

On implementation of Neural Network for modeling the HCDR dataset we got a trainng accuracy of 91.95% and a test roc score of 0.71225. Here, we used the features that we created in the previous phase of the project and selected features just with high correlation as strong predictors that we would train our neural network with. After trying different combinations of hiddem layers and the number of neurons in them we found that 4 hidden layers with 128,64,32,10 neurons respectively worked the best with an output layer having 2 neurons. We tried out 3 optimizers namely AdaDelta, Adam and RMSprop where we found that the Adam optimizer worked the best for us. Since we did not want to overtrain and overfit the model we limited the epochs to 1000 after which we did not find any significant difference for the increase in accuracy.

Here are some Tensorboard visualizations for Accuracy and Loss while training

Screen%20Shot%202021-12-13%20at%203.09.14%20PM.png

Screen%20Shot%202021-12-13%20at%203.09.33%20PM.png

Kaggle Test Accuracy

For Neural Network

Screen%20Shot%202021-12-13%20at%204.45.11%20PM.png

For Logistic Regression

image.png

For Random Forest Classifier

image.png

With high training accuracy and a low test accuracy we can see that the Random forest classifier seems to be over fitting and thus the ideal choice of model would be Logistic Regression

CONCLUSION

Our project focuses on predicting whether the credit-less population are able to pay back their loans or not. In order to make this possible we source our data from the Home Credit dataset. It is very important that this population also gets a fair chance of getting a loan and as we being students can highly relate to this. Hence we decide to pursue this project. In previous phases, we understood the data by performing exploratory data analysis and visualization. After which we perform pre-processing and clean the data accordingly.We featured the data performing OHE and applied imputing methods to fix the data before feeding it to the model. In phase 2 of this project we had performed feature engineering by adding new columns to the model for improved accuracy and performed hyperparameter tuning to find out the best setting and parameters for the models and were able to achieve the test ROC Score of 0.728 using random forest classifier and training accuracy of 92.4%. We have used Logistic Regression and Random Forests Classifier models. Finally we have evaluated the results using accuracy score, log loss, confusion matrix and ROC AUC scores. In this last phase we implemented a Multilayer Perceptron(MLP) model using Pytorch for loan default classification. We found out the accuracy for the MLP model to be 91.95% and a test roc score of 0.71225 which is pretty close to our previous non deep learning models. Deep Learning models require huge amount of data to train itself and thus on a longer run Deep Learning models would work best for HCDR classification as compared to usual supervised models. The future scope for this project can include using embeddings in deep learning models or using some advanced classification models like lightGBM/other boosting models that can produce better results.

References:https://www.kaggle.com/c/home-credit-default-risk/data